Systems and methods for paired end sequencing

ABSTRACT

Systems and methods for analyzing overlapping sequence information can obtain first and second overlapping sequence information for a polynucleotide, align the first and second sequence information, determine a degree of agreement between the first and second sequence information for a location along the polynucleotide, and determine a base call and a quality value for the location.

RELATED APPLICATIONS

This application claims priority pursuant to 35 U.S.C. §119(e) to U.S.Provisional Patent Application Ser. No. 61/640,288, entitled “Systemsand Methods for Paired End Sequencing”, filed on Apr. 30, 2012, theentirety of which is incorporated herein by reference as if set forth infull.

FIELD

The present disclosure generally relates to the field of nucleic acidsequencing including systems and methods for paired end sequencing.

INTRODUCTION

Upon completion of the Human Genome Project, one focus of the sequencingindustry has shifted to finding higher throughput and/or lower costnucleic acid sequencing technologies, sometimes referred to as “nextgeneration” sequencing (NGS) technologies. In making sequencing higherthroughput and/or less expensive, the goal is to make the technologymore accessible. These goals can be reached through the use ofsequencing platforms and methods that provide sample preparation forsamples of significant complexity, sequencing larger numbers of samplesin parallel (for example through use of barcodes and multiplexanalysis), and/or processing high volumes of information efficiently andcompleting the analysis in a timely manner. Various methods, such as,for example, sequencing by synthesis, sequencing by hybridization, andsequencing by ligation are evolving to meet these challenges.

Ultra-high throughput nucleic acid sequencing systems incorporating NGStechnologies typically produce a large number of short sequence reads.Sequence processing methods should desirably assemble and/or map a largenumber of reads quickly and efficiently, such as to minimize use ofcomputational resources. For example, data arising from sequencing of amammalian genome can result in tens or hundreds of millions of readsthat typically need to be assembled before they can be further analyzedto determine their biological, diagnostic and/or therapeutic relevance.

Exemplary applications of NGS technologies include, but are not limitedto: genomic resequencing including genomic variant detection, such asinsertions/deletions, copy number variations, single nucleotidepolymorphisms, etc., gene expression analysis and genomic profiling.

Of particular interest are improved systems and methods for detectingsomatic mutations, such as those found in cancerous tumors. For example,identification of a somatic mutation specific to a cancerous tumor andnot found in normal tissue can lead to insights into the development ofcancer, aid in the discovery of new cancer treatments, or guide theselection of appropriate treatments for a cancer patient.

From the foregoing it will be appreciated that a need exists for systemsand methods that can identify somatic mutations using nucleic acidsequencing data.

DRAWINGS

For a more complete understanding of the principles disclosed herein,and the advantages thereof, reference is now made to the followingdescriptions taken in conjunction with the accompanying drawings, inwhich:

FIG. 1 is a block diagram that illustrates an exemplary computer system,in accordance with various embodiments.

FIG. 2 is a schematic diagram of an exemplary system for determining anucleic acid sequence, in accordance with various embodiments.

FIG. 3 is an exemplary iongram in accordance with various embodiments.

FIGS. 4 and 5 are illustrations of the relationship between flow spaceand base space, in accordance with various embodiments.

FIG. 6 is a flow diagram illustrating an exemplary method of analyzingpaired end read information, in accordance with various embodiments.

FIG. 7 is a flow diagram illustrating an exemplary method of aligningflow series information for paired end reads, in accordance with variousembodiments.

FIG. 8 is an illustration of a flow space alignment, in accordance withvarious embodiments.

FIG. 9 is a flow diagram illustrating an exemplary method of analyzingpaired end reads information to identify low frequency variants, inaccordance with various embodiments.

FIG. 10 is a schematic diagram of an exemplary genetic analysis system,in accordance with various embodiments.

It is to be understood that the figures are not necessarily drawn toscale, nor are the objects in the figures necessarily drawn to scale inrelationship to one another. The figures are depictions that areintended to bring clarity and understanding to various embodiments ofapparatuses, systems, and methods disclosed herein. Wherever possible,the same reference numbers will be used throughout the drawings to referto the same or like parts. Moreover, it should be appreciated that thedrawings are not intended to limit the scope of the present teachings inany way.

DESCRIPTION OF VARIOUS EMBODIMENTS

Embodiments of systems and methods for detecting low frequency variantsare described herein.

The section headings used herein are for organizational purposes onlyand are not to be construed as limiting the described subject matter inany way.

In this detailed description of the various embodiments, for purposes ofexplanation, numerous specific details are set forth to provide athorough understanding of the embodiments disclosed. One skilled in theart will appreciate, however, that these various embodiments may bepracticed with or without these specific details. In other instances,structures and devices are shown in block diagram form. Furthermore, oneskilled in the art can readily appreciate that the specific sequences inwhich methods are presented and performed are illustrative and it iscontemplated that the sequences can be varied and still remain withinthe spirit and scope of the various embodiments disclosed herein.

All literature and similar materials cited in this application,including but not limited to, patents, patent applications, articles,books, treatises, and internet web pages are expressly incorporated byreference in their entirety for any purpose. Unless described otherwise,all technical and scientific terms used herein have a meaning as iscommonly understood by one of ordinary skill in the art to which thevarious embodiments described herein belongs.

It will be appreciated that there is an implied “about” prior to thetemperatures, concentrations, times, number of bases, coverage, etc.discussed in the present teachings, such that slight and insubstantialdeviations are within the scope of the present teachings. In thisapplication, the use of the singular includes the plural unlessspecifically stated otherwise. Also, the use of “comprise”, “comprises”,“comprising”, “contain”, “contains”, “containing”, “include”,“includes”, and “including” are not intended to be limiting. It is to beunderstood that both the foregoing general description and the followingdetailed description are exemplary and explanatory only and are notrestrictive of the present teachings.

As used herein, “a” or “an” also may refer to “at least one” or “one ormore.” Also, the use of “or” is inclusive, such that the phrase “A or B”is true when “A” is true, “B” is true, or both “A” and “B” are true.

Further, unless otherwise required by context, singular terms shallinclude pluralities and plural terms shall include the singular.Generally, nomenclatures utilized in connection with, and techniques of,cell and tissue culture, molecular biology, and protein and oligo- orpolynucleotide chemistry and hybridization described herein are thosewell known and commonly used in the art. Standard techniques are used,for example, for nucleic acid purification and preparation, chemicalanalysis, recombinant nucleic acid, and oligonucleotide synthesis.Enzymatic reactions and purification techniques are performed accordingto manufacturer's specifications or as commonly accomplished in the artor as described herein. The techniques and procedures described hereinare generally performed according to conventional methods well known inthe art and as described in various general and more specific referencesthat are cited and discussed throughout the instant specification. See,e.g., Sambrook et al., Molecular Cloning: A Laboratory Manual (Thirded., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y.2000). The nomenclatures utilized in connection with, and the laboratoryprocedures and techniques described herein are those well known andcommonly used in the art.

A “system” sets forth a set of components, real or abstract, comprisinga whole where each component interacts with or is related to at leastone other component within the whole.

A “biomolecule” may refer to any molecule that is produced by abiological organism, including large polymeric molecules such asproteins, polysaccharides, lipids, and nucleic acids (DNA and RNA) aswell as small molecules such as primary metabolites, secondarymetabolites, and other natural products.

The phrase “next generation sequencing” or NGS refers to sequencingtechnologies having increased throughput as compared to traditionalSanger- and capillary electrophoresis-based approaches, for example withthe ability to generate hundreds of thousands of relatively smallsequence reads at a time. Some examples of next generation sequencingtechniques include, but are not limited to, sequencing by synthesis,sequencing by ligation, and sequencing by hybridization. Morespecifically, the Personal Genome Machine (PGM) of Life TechnologiesCorp. provides massively parallel sequencing with enhanced accuracy. ThePGM System and associated workflows, protocols, chemistries, etc. aredescribed in more detail in U.S. Patent Application Publication No.2009/0127589 and No. 2009/0026082, the entirety of each of theseapplications being incorporated herein by reference.

The phrase “sequencing run” refers to any step or portion of asequencing experiment performed to determine some information relatingto at least one biomolecule (e.g., nucleic acid molecule).

The phase “base space” refers to a representation of the sequence ofnucleotides. The phase “flow space” refers to a representation of theincorporation event or non-incorporation event for a particularnucleotide flow. For example, flow space can be a series of zeros andones representing a nucleotide incorporation event (a one, “1”) or anon-incorporation event (a zero, “0”) for that particular nucleotideflow. It should be understood that zeros and ones are convenientrepresentations of a non-incorporation event and a nucleotideincorporation event; however, any other symbol or designation could beused alternatively to represent and/or identify these events andnon-events.

To illustrate the interplay between base-space vectors, flow-spacevectors, and nucleotide flow orders, one may consider, for example, anunderlying template sequence beginning with “TA” subjected to multiplecycles of a nucleotide flow order of “TACG.” The first flow, “T,” wouldresult in a non-incorporation because it is not complementary to thetemplate's first base, “T.” In the base-space vector, no nucleotidedesignation would be inserted; in the flow-space vector, a “0” would beinserted, leading to “0.” The second flow, “A,” would result in anincorporation because it is complementary to the template's first base,“T.” In the base-space vector, an “A” would be inserted, leading to “A”;in the flow-space vector, a “1” would be inserted, leading to “01.” Thethird flow “C” would result in a non-incorporation because it is notcomplementary to the template's second base, “A.” In the base-spacevector, no nucleotide designation would be inserted; in the flow-spacevector, a “0” would be inserted, leading to “010.” The fourth flow, “G,”would result in a non-incorporation because it is not complementary tothe template's second base, “A.” In the base-space vector, no nucleotidedesignation would be inserted; in the flow-space vector, a “0” would beinserted, leading to “0100.” The fifth flow “T” would result in anincorporation because it is complementary to the template's second base,“A.” In the base-space vector, a “T” would be inserted, leading to “AT”;in the flow-space vector, a “1” would be inserted, leading to “01 001.”(Note: if the analysis were to contemplate a potentially longertemplate, an “X” could be inserted here instead because additional “A's”could potentially be present in the template in the case of a longerhomopolymer, which would allow for more than one incorporations duringthe fifth flow, leading to “01 OOX.”) The base-space vector thus showsonly the sequence of incorporated nucleotides, whereas the flowspacevector shows more expressly the incorporation status corresponding toeach flow. Whereas a base-space representation may be fixed and remaincommon for various flow orders, the flow-based representation depends onthe particular flow order. Knowing the nucleotide flow order, one caninfer either vector from the other. Of course, the base-space vectorcould be represented using complementary bases rather than theincorporated bases (thus, one could just as well define the base-spacerepresentation of a sequencing key as being the incorporated nucleotidesor as being the complementary nucleotides of the template against whichthe flowed nucleotides would be incorporated).

DNA (deoxyribonucleic acid) is a chain of nucleotides consisting of 4types of nucleotides; A (adenine), T (thymine), C (cytosine), and G(guanine), and that RNA (ribonucleic acid) is comprised of 4 types ofnucleotides; A, U (uracil), G, and C. Certain pairs of nucleotidesspecifically bind to one another in a complementary fashion (calledcomplementary base pairing). That is, adenine (A) pairs with thymine (T)(in the case of RNA, however, adenine (A) pairs with uracil (U)), andcytosine (C) pairs with guanine (G). When a first nucleic acid strandbinds to a second nucleic acid strand made up of nucleotides that arecomplementary to those in the first strand, the two strands bind to forma double strand. As used herein, “nucleic acid sequencing data,”“nucleic acid sequencing information,” “nucleic acid sequence,” “genomicsequence,” “genetic sequence,” or “fragment sequence,” or “nucleic acidsequencing read” denotes any information or data that is indicative ofthe order of the nucleotide bases (e.g., adenine, guanine, cytosine, andthymine/uracil) in a molecule (e.g., whole genome, whole transcriptome,exome, oligonucleotide, polynucleotide, fragment, etc.) of DNA or RNA.It should be understood that the present teachings contemplate sequenceinformation obtained using all available varieties of techniques,platforms or technologies, including, but not limited to: capillaryelectrophoresis, microarrays, ligation-based systems, polymerase-basedsystems, hybridization-based systems, direct or indirect nucleotideidentification systems, pyrosequencing, ion- or pH-based detectionsystems, electronic signature-based systems, etc.

A “polynucleotide”, “nucleic acid”, or “oligonucleotide” refers to alinear polymer of nucleosides (including deoxyribonucleosides,ribonucleosides, or analogs thereof) joined by internucleosidiclinkages. Typically, a polynucleotide comprises at least threenucleosides. Usually oligonucleotides range in size from a few monomericunits, e.g. 3-4, to several hundreds of monomeric units. Whenever apolynucleotide such as an oligonucleotide is represented by a sequenceof letters, such as “ATGCCTG,” it will be understood that thenucleotides are in 5′→3′ order from left to right and that “A” denotesdeoxyadenosine, “C” denotes deoxycytidine, “G” denotes deoxyguanosine,and “T” denotes thymidine, unless otherwise noted. The letters A, C, G,and T may be used to refer to the bases themselves, to nucleosides, orto nucleotides comprising the bases, as is standard in the art.

As used herein, the phrase “paired end sequencing” or “paired end reads”can refer to sequencing techniques generally known in the art ofmolecular biology that can allow the determination of multiple “reads”of sequence, each from a different place on a single polynucleotide. Invarious embodiments, two reads can be in opposite directions along thepolynucleotide and can have regions of overlap. Because the overlappingportions provide redundant information for a region of thepolynucleotide, the use of the information from paired end reads can beused to correct errors during sequencing of a read, thereby improvingthe accuracy of the determined sequence.

Computer-Implemented System

FIG. 1 is a block diagram that illustrates a computer system 100, uponwhich embodiments of the present teachings may be implemented. Invarious embodiments, computer system 100 can include a bus 102 or othercommunication mechanism for communicating information, and a processor104 coupled with bus 102 for processing information. In variousembodiments, computer system 100 can also include a memory 106, whichcan be a random access memory (RAM) or other dynamic storage device,coupled to bus 102 for determining base calls, and instructions to beexecuted by processor 104. Memory 106 also can be used for storingtemporary variables or other intermediate information during executionof instructions to be executed by processor 104. In various embodiments,computer system 100 can further include a read only memory (ROM) 108 orother static storage device coupled to bus 102 for storing staticinformation and instructions for processor 104. A storage device 110,such as a magnetic disk or optical disk, can be provided and coupled tobus 102 for storing information and instructions.

In various embodiments, computer system 100 can be coupled via bus 102to a display 112, such as a cathode ray tube (CRT) or liquid crystaldisplay (LCD), for displaying information to a computer user. An inputdevice 114, including alphanumeric and other keys, can be coupled to bus102 for communicating information and command selections to processor104. Another type of user input device is a cursor control 116, such asa mouse, a trackball or cursor direction keys for communicatingdirection information and command selections to processor 104 and forcontrolling cursor movement on display 112. This input device typicallyhas two degrees of freedom in two axes, a first axis (i.e., x) and asecond axis (i.e., y), that allows the device to specify positions in aplane.

A computer system 100 can perform the present teachings. Consistent withcertain implementations of the present teachings, results can beprovided by computer system 100 in response to processor 104 executingone or more sequences of one or more instructions contained in memory106. Such instructions can be read into memory 106 from anothercomputer-readable medium, such as storage device 110. Execution of thesequences of instructions contained in memory 106 can cause processor104 to perform the processes described herein. Alternatively hard-wiredcircuitry can be used in place of or in combination with softwareinstructions to implement the present teachings. Thus implementations ofthe present teachings are not limited to any specific combination ofhardware circuitry and software.

The term “computer-readable medium” as used herein refers to any mediathat participates in providing instructions to processor 104 forexecution. Such a medium can take many forms, including but not limitedto, non-volatile media, volatile media, and transmission media. Examplesof non-volatile media can include, but are not limited to, optical ormagnetic disks, such as storage device 110. Examples of volatile mediacan include, but are not limited to, dynamic memory, such as memory 106.Examples of transmission media can include, but are not limited to,coaxial cables, copper wire, and fiber optics, including the wires thatcomprise bus 102.

Common forms of non-transitory computer-readable media include, forexample, a floppy disk, a flexible disk, hard disk, magnetic tape, orany other magnetic medium, a CD-ROM, any other optical medium, punchcards, paper tape, any other physical medium with patterns of holes, aRAM, PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge,or any other tangible medium from which a computer can read.

In accordance with various embodiments, instructions configured to beexecuted by a processor to perform a method are stored on acomputer-readable medium. The computer-readable medium can be a devicethat stores digital information. For example, a computer-readable mediumincludes a compact disc read-only memory (CD-ROM) as is known in the artfor storing software. The computer-readable medium is accessed by aprocessor suitable for executing instructions configured to be executed.

Nucleic Acid Sequencing Platforms

Nucleic acid sequence data can be generated using various techniques,platforms or technologies, including, but not limited to: capillaryelectrophoresis, microarrays, ligation-based systems, polymerase-basedsystems, hybridization-based systems, direct or indirect nucleotideidentification systems, pyrosequencing, ion- or pH-based detectionsystems, electronic signature-based systems, nanopore based systems,etc.

Various embodiments of nucleic acid sequencing platforms, such as anucleic acid sequencer, can include components as displayed in the blockdiagram of FIG. 2. According to various embodiments, sequencinginstrument 200 can include a fluidic delivery and control unit 202, asample processing unit 204, a signal detection unit 206, and a dataacquisition, analysis and control unit 208. Various embodiments ofinstrumentation, reagents, libraries and methods used for nextgeneration sequencing are described in U.S. Patent ApplicationPublication No. 2009/0127589 and No. 2009/0026082 are incorporatedherein by reference. Various embodiments of instrument 200 can providefor automated sequencing that can be used to gather sequence informationfrom a plurality of sequences in parallel, such as substantiallysimultaneously.

In various embodiments, the fluidics delivery and control unit 202 caninclude reagent delivery system. The reagent delivery system can includea reagent reservoir for the storage of various reagents. The reagentscan include nucleotide tri-phosphates (such as ATP, CTP, GTP, TTP),RNA-based primers, forward/reverse DNA primers, oligonucleotide mixturesfor ligation sequencing, nucleotide mixtures forsequencing-by-synthesis, buffers, wash reagents, blocking reagent,stripping reagents, and the like. Additionally, the reagent deliverysystem can include a pipetting system or a continuous flow system whichconnects the sample processing unit with the reagent reservoir.

In various embodiments, the sample processing unit 204 can include asample chamber, such as flow cell, a substrate, a micro-array, amulti-well tray, or the like. The sample processing unit 204 can includemultiple lanes, multiple channels, multiple wells, or other means ofprocessing multiple sample sets substantially simultaneously.Additionally, the sample processing unit can include multiple samplechambers to enable processing of multiple runs simultaneously. Inparticular embodiments, the system can perform signal detection on onesample chamber while substantially simultaneously processing anothersample chamber. Additionally, the sample processing unit can include anautomation system for moving or manipulating the sample chamber.

In various embodiments, the signal detection unit 206 can include animaging or detection sensor. For example, the imaging or detectionsensor can include a CCD, a CMOS, an ion or chemical sensor, such as anion sensitive layer overlying a CMOS or FET, a current or voltagedetector, or the like. The signal detection unit 206 can include anexcitation system to cause a probe, such as a fluorescent dye, to emit asignal. The excitation system can include an illumination source, suchas arc lamp, a laser, a light emitting diode (LED), or the like. Inparticular embodiments, the signal detection unit 206 can include opticsfor the transmission of light from an illumination source to the sampleor from the sample to the imaging or detection sensor. Alternatively,the signal detection unit 206 may provide for electronic or non-photonbased methods for detection and consequently not include an illuminationsource. In various embodiments, electronic-based signal detection mayoccur when a detectable signal or species is produced during asequencing reaction. For example, a signal can be produced by theinteraction of a released byproduct or moiety, such as a released ion,such as a hydrogen ion, interacting with an ion or chemical sensitivelayer. In other embodiments a detectable signal may arise as a result ofan enzymatic cascade such as used in pyrosequencing (see, for example,U.S. Patent Application Publication No. 2009/0325145, the entirety ofwhich being incorporated herein by reference) where pyrophosphate isgenerated through base incorporation by a polymerase which furtherreacts with ATP sulfurylase to generate ATP in the presence of adenosine5′ phosphosulfate wherein the ATP generated may be consumed in aluciferase mediated reaction to generate a chemiluminescent signal. Inanother example, changes in an electrical current can be detected as anucleic acid passes through a nanopore without the need for anillumination source.

In various embodiments, a data acquisition analysis and control unit 208can monitor various system parameters. The system parameters can includetemperature of various portions of instrument 200, such as sampleprocessing unit or reagent reservoirs, volumes of various reagents, thestatus of various system subcomponents, such as a manipulator, a steppermotor, a pump, or the like, or any combination thereof.

It will be appreciated by one skilled in the art that variousembodiments of instrument 200 can be used to practice variety ofsequencing methods including ligation-based methods, sequencing bysynthesis, single molecule methods, nanopore sequencing, and othersequencing techniques.

In various embodiments, the sequencing instrument 200 can determine thesequence of a nucleic acid, such as a polynucleotide or anoligonucleotide. The nucleic acid can include DNA or RNA, and can besingle stranded, such as ssDNA and RNA, or double stranded, such asdsDNA or a RNA/cDNA pair. In various embodiments, the nucleic acid caninclude or be derived from a fragment library, a mate pair library, aChIP fragment, or the like. In particular embodiments, the sequencinginstrument 200 can obtain the sequence information from a single nucleicacid molecule or from a group of substantially identical nucleic acidmolecules.

In various embodiments, sequencing instrument 200 can output nucleicacid sequencing read data in a variety of different output data filetypes/formats, including, but not limited to: *.fasta, *.csfasta,*seq.txt, *qseq.txt, *.fastq, *.sff, *prb.txt, *.sms, *srs and/or *.qv.

In various embodiments, sequencing can be performed in bi-directionalsequencing of a fragment by sequencing the fragment in both directions.For example, the sequencing instrument 200 can determine a first(forward) read of a nucleic acid while a forward primer can be extendedin a forward direction along a single stranded template. The templatecan be prepared for a second (reverse) read. For example, the forwardprimer can be extended the full length of the template, and the templatecan be nicked and degraded to leave a portion that can act as a primerfor the reverse read. The sequencing instrument 200 can determine thesecond (reverse) read while the reverse primer can be extended in theopposite direction along the extended forward primer. For example,bidirectional sequencing is described in more detail in co-pending U.S.application Ser. No. 13,543,521, filed Jul. 6, 2012 and titled“Sequencing Methods and Compositions”, which is incorporated byreference in entirety.

Flow Space

FIG. 3 shows an exemplary ionogram representation of signals from whichbase calls may be made. In this example, the x-axis shows the nucleotidethat is flowed and the corresponding number of nucleotide incorporationsmay be estimated by rounding to the nearest integer shown in the y-axis,for example. Signals used to make base calls and determine a flowspacevector may be from any suitable point in the acquisition or processingof the data signals received from sequencing operations. For example,the signals may be raw acquisition data or data having been processed,such as, e.g., by background filtering, normalization, correction forsignal decay, and/or correction for phase errors or effects, etc. Thebase calls may be made by analyzing any suitable signal characteristics(e.g., signal amplitude or intensity).

In various embodiments, output signals due to nucleotide incorporationmay be processed in various way to improve their quality and/orsignal-to-noise ratio, which may include performing or implementing oneor more of the teachings disclosed in Rearick et al., U.S. patentapplication Ser. No. 13/339,846, filed Dec. 29, 2011, and in Hubbell,U.S. patent application Ser. No. 13/339,753, filed Dec. 29, 2011, whichare all incorporated by reference herein in their entirety.

In various embodiments, output signals due to nucleotide incorporationmay be further processed, given knowledge of what nucleotide specieswere flowed and in what order to obtain such signals, to make base callsfor the flows and compile consecutive base calls associated with asample nucleic acid template into a read. A base call refers to aparticular nucleotide identification (e.g., dATP (“A”), dCTP (“C”), dGTP(“G”), or dTTP (“T”)). Base calling may include performing one or moresignal normalizations, signal phase and signal droop (e.g., enzymeefficiency loss) estimations, and signal corrections, and may identifyor estimate base calls for each flow for each defined space. Basecalling may include performing or implementing one or more of theteachings disclosed in Davey et al., U.S. patent applcation Ser. No.13/283,320, filed Oct. 27, 2011, which is incorporated by referenceherein in its entirety. Other aspects of signal processing and basecalling may include performing or implementing one or more of theteachings disclosed in Davey et al., U.S. patent application Ser. No.13/340,490, filed on Dec. 29, 2011, and Sikora et al., U.S. patentapplication Ser. No. 13/588,408, filed on Aug. 17, 2012, which are allincorporated by reference herein in their entirety.

FIGS. 4 and 5 demonstrate a relationship between a base space sequenceand a flowspace vector. A series of signals representative of a numberof incorporations or lack thereof (e.g., 0-mer, 1-mer, 2-mer, etc.)produced by a series of nucleotide flows may be referred to as aflowspace vector or sequence, as opposed to a base space sequence, whichis simply the order of identified nucleotide bases in a nucleic acid ofinterest. The flowspace vector may be produced using any suitablenucleotide flow ordering, including a predetermined ordering based on acyclical, repeating pattern of consecutive repeats of a predeterminedreagent flow ordering, based on a random reagent flow ordering, or basedon an ordering comprising in whole or in part a phase-protecting reagentflow ordering as described in Hubbell et al., U.S. patent applicationSer. No. 13/440,849, filed Apr. 5, 2012, or some combination thereof. InFIGS. 4 and 5, an exemplary base space sequence AGTCCA is subjected tosequencing operations using a cyclical flow ordering of TACG (that is, aT nucleotide flow, followed by an A nucleotide flow, followed by a Cnucleotide flow, followed by a G nucleotide flow, and this 4-flowordering is then repeated cyclically). The flows result in a series ofsignals having an amplitude (e.g., signal intensity) related to thenumber of nucleotide incorporations (e.g., 0-mer, 1-mer, 2-mer, etc.).This series of signals generates the flowspace vector 101001021. Asshown in FIG. 4, the base space sequence AGTCCA may be translated to aflowspace vector 101001021 under a cyclical flow ordering of TACG. Theflowspace vector may change if the flow ordering is changed. As shown inFIG. 5, the flowspace vector may be mapped back to the base spacesequence associated with the sample.

System and Methods for Analyzing Paired End Read Information

FIG. 6 is an exemplary flow diagram showing a method 600 for analyzingpaired end read information, in accordance with various embodiments. Invarious exemplary embodiment, multiple overlapping reads of the samepolynucleotide can be used to correct sequencing errors in one of thereads, thereby improving the accuracy and confidence in the resultingsequence.

At 602, paired end read information can be obtained. In variousexemplary embodiments, the paired end read information can be obtainedby sequencing a polynucleotide from opposite directions, such as from5′→3′ and 3′→5′.

At 604, the paired end reads can be aligned to identify the overlappingregions of the reads. In various embodiments, the alignment can beperformed in base space, flow space, color space, or otherrepresentations of the information received from a sequencing system. Inparticular embodiments, aligning the paired end reads in a signal space(such as flow space or color space) can prove to be advantageous.Specifically, by utilizing the signal information, combining informationfrom ambiguous signals can provide sufficient evidence to confidentlycall the base sequence. For example, for a homopolymer stretch, flowspace information from a first read may ambiguously indicate thehomopolymer length as 4 to 5 bases, whereas flow space information fromthe paired second read may ambiguously indicate the homopolymer lengthas 3 to 4 bases. Whereas use of only base space information could resultin an indication of a homopolymer length of 3 to 5 bases, using thesignal space information can provide sufficient evidence to identify thehomopolymer length as 4 bases.

At 606, a position can be selected in the overlapping region of thepaired end reads. At 608, the paired end reads can be compared at theselected position to determine if the paired end reads are concordant,such that the paired end reads indicate agreement at the position.

At 610, when the paired end reads are concordant at a position, such asfor example both reads provide evidence for C at the selected position,a call can be made for a consensus read at the selected position. At612, a concordant quality value can be calculated for the selectedposition. In various embodiments, the concordant quality value can begreater than a quality value determined for a base call based on eitherread individually.

At 614, a determination can be made as to if there are additionalpositions in the overlapping region of the paired end reads. When thereare additional positions, another position can be selected at 606.Alternatively, when there are no additional positions, the sequencecalls and quality values can be provided, as indicated at 616.

Returning to 608, when the paired reads are not concordant, a weightedaverage of the signal information can be calculated for the position, asindicated at 618. In various embodiments, the accuracy of the reads maybe different at a particular position. For example, the accuracy of thereads may decrease along the length of the read such as due to carryforward and incomplete extensions errors. As such, a signal from a readat a position closer to the beginning can be weighted more that a signalfrom a read at a position closer to the end. Additionally, as thesequence context may be different for the paired end reads, modeling ofcontext dependent errors can be used to influence the weighting of thesignals of the paired end reads.

At 620, a call may be made for the position based on the weightedaverage signal, and at 622, a discordant quality value can bedetermined. Discordant reads provide evidence that a sequencing erroroccurred at the position for at least one of the reads. The call can bemade based on assumptions as to which read is less likely to be inerror, but it may not be possible to tell in which read the sequencingerror occurred. Thus, in various embodiments, the discordant qualityvalue may be lower than a quality value determined for an individualread.

FIG. 7 is an exemplary flow diagram illustrating a method of aligningflow series information for paired end reads, in accordance with variousembodiments. A flow series includes information about the signalsgenerated or observed by a sequencing instrument when a samplepolynucleotide is exposed to a series of nucleotide flows, such as underconditions to allow synthesis of a complimentary polynucleotide. As theseries of nucleotide flows may not precisely match the order ofnucleotides incorporated into the complimentary polynucleotide, the flowseries can contain “empty flows” indicative of a nucleotide flow thatdid not result in an incorporation of the nucleotide. Additionally, fora homopolymer stretch, a single nucleotide flow may result in multipleincorporations and result in a signal proportional to the number ofincorporations. Further, the order of the nucleotide flows, “floworder”, may vary, such that, for example, an ‘A’ is not always flowedafter a ‘C’. These factors can make the flow series information highlydependent on both the sequence context and the starting point for theread. Specifically, two reads of the same polynucleotide may havedifferent numbers of empty flows between incorporation events dependingon the starting point of the two reads. Thus, there may not be a 1:1mapping of the flows for the two reads. Additionally, when the two readsare performed in opposite directions, such as 5′→3′ and 3′→5′, thesequence context for the two reads can be different.

At 702, flow series information can be obtained for paired end reads. Invarious exemplary embodiments, the flow series information can beobtained by sequencing a polynucleotide or the flow series informationcan be provided as a data file from a sequencing instrument.

At 704, initial base sequences for the paired end reads can be obtained.In various embodiments, the initial base sequences for the paired endreads can be provided as a file, such as along with the flow seriesinformation or in a separate file, or can be determined based on theflow series information. Various methods are known in the art fordetermining an initial base call, for example, a method of determiningan initial base sequence is described in U.S. patent application Ser.No. 13/340,490, the entirety of which is incorporated herein byreference.

At 706, an initial alignment of the paired end reads can be determinedusing the base sequence for the reads. In various embodiments, theinitial alignment can be limited to only a portion of the paired basesequence, such as not more than 100 bases, such as not more than 50bases, even not more than 20 bases, or such as not more than one half ofthe base sequence length, such as not more than one fourth of the basesequence length, even not more than one tenth of the base sequencelength. In various embodiments, as the paired end reads representsequence data from the same polynucleotide, corresponding reads shouldbe substantially concordant, only differing in positions where asequencing error has occurred in at least one of the paired end reads.As such, the initial alignment may be a perfect match, such that overthe length of the aligned portion no mismatches are present.

At 708, the flow series information for the reads can be mapped to thealigned base sequence. For example, the non-empty flows that correspondto the aligned base calls can be mapped to the aligned positions whileempty flows in between the non-empty flows can be mapped between thepositions. Where the empty flows of the flow series correspond, theempty flows can also be aligned.

By way of example, given an aligned sequence of CTG (corresponding tothe complementary CAG in the reverse direction), a base space alignmentcan be defined as follows.

5′ C T G 3′ 3′ G A C 5′

With corresponding flow orders of CAGTCAG and CTATG for the top andbottom reads respectively, the empty flows can be mapped as follows,with the lowercase letters representing the empty flows.

5′ C a g T c a G 3′ 3′ G t   A   t C 5′

At 710, the flow series alignment can be extended in both directions byaligning non-empty flows with empty flows spaced between. Where thereare corresponding empty flows, the empty flows can also be aligned.

FIG. 8 illustrates a flow space alignment for a forward (FWD) andreverse (REV) read of a fragment. For illustrative purposes, theorientation of the FWD flow vector is reversed to align orientationswith the REV flow vector, and the positive flows of the FWD flow vectorare mapped to the flow order of the REV flow vector. Alignment in flowspace of the two reads allows a comparison of the flow values inaddition to the base calls. In the example illustrated, flow position 37of the REV read shows a homopolymer of 5 Ts with a flow value of 528.The corresponding position in the FWD read shows a homopolymer of 4 Aswith a flow value of 420. As a result, the confidence in the call atthis position can be lower than the confidence in other calls wherethere is a consensus between the FWD and REV reads. Additionally, aconsensus call can be made by taking a weighted average of the call. Thevalue of the REV read may be have a higher weighting given it is earlierin the flow sequence (37 vs. 184).

FIG. 9 is a flow diagram illustrating an exemplary method of analyzingpaired end read information to identify low frequency variants, inaccordance with various embodiments.

At 902, paired end reads for a plurality of polynucleotides, such asfrom the same sample, can be obtained, and at 904, corresponding pairedend reads can be aligned.

At 906, polynucleotides can be identified where the paired end reads arealigned. In various embodiments, paired end reads for a portion of thepolynucleotides may not be able to be aligned. For example, when thelength of the reads is relatively short compared to the length of thepolynucleotide, there may be insufficient overlap of the reads. Inanother example, when there is a large number of sequencing errors for aread, it may not be possible to identify a sufficiently large regionwith an alignment having no mismatches. Regardless of the reason,polynucleotides where there is not an alignment of the paired end readscan be excluded from further analysis.

At 908, consensus sequences from polynucleotides with aligned reads canbe mapped, such as to a reference sequence or to one another, and thesequences can be compared to identify variants, as indicated at 910.Various methods are known in the art for identifying variants, forexample, a method of identifying variants is described in U.S.Provisional Patent Application No. 61/584,391, the entirety of which isincorporated herein by reference and included as Exhibit 1.

Generally, where the paired end reads are concordant, there is a highdegree of confidence that there are not sequencing errors for the pairedend reads at that position. For example, if there is an expected errorrate of about 1% (0.01), the expected likelihood of an error occurringin two paired end reads at the same position would be about 0.01%(0.0001). Thus by analyzing only concordant reads at a position, theexpected error rate can be significantly decreased and low frequencyvariants may be identified with confidence at lower frequencies thanwith data from the entire sample at a higher expected error rate.

FIG. 10 is a schematic diagram of a system for identifying variants, inaccordance with various embodiments.

As depicted herein, sequence analysis system 1000 can include a nucleicacid sequence analysis device 1004 (e.g., nucleic acid sequencer,real-time/digital/quantitative PCR instrument, microarray scanner,etc.), a sequence analytics computing server/node/device 1002, and adisplay 1010 and/or a client device terminal 1008.

In various embodiments, the sequence analytics computingsever/node/device 1002 can be communicatively connected to the nucleicacid sequence analysis device 1004, and client device terminal 1008 viaa network connection 1024 that can be either a “hardwired” physicalnetwork connection (e.g., Internet, LAN, WAN, VPN, etc.) or a wirelessnetwork connection (e.g., Wi-Fi, WLAN, etc.).

In various embodiments, the sequence analytics computingdevice/server/node 1002 can be a workstation, mainframe computer,distributed computing node (such as, part of a “cloud computing” ordistributed networking system), personal computer, mobile device, etc.In various embodiments, the nucleic acid sequence analysis device 1004can be a nucleic acid sequencer, real-time/digital/quantitative PCRinstrument, microarray scanner, etc. It should be understood, however,that the nucleic acid sequence analysis device 1004 can essentially beany type of instrument that can generate nucleic acid sequence data fromsamples obtained from an individual.

The sequence analytics computing server/node/device 1002 can beconfigured to host an optional pre-processing module 1012, and a pairedend analysis module 1016.

Pre-processing module 1012 can be configured to receive information fromthe nucleic acid sequence analysis device 1004 and perform preprocessingsteps, such as conversion from f space to base space, color space tobase space, or from flow space to base space, determining initial callquality values, preparing the read data for use by the paired endanalysis module 1016, and the like.

The paired end analysis module 1016 can include a paired read alignmentengine 1014, a sequence calling engine 1018, a scoring engine 1020, andan optional post processing engine 1022. In various embodiments, pairedend analysis module 1016 can be in communications with the preprocessingmodule 1012. That is, the paired end analysis module 1016 can requestand receive data and information (through, e.g., data streams, datafiles, text files, etc.) from preprocessing module 1012.

The paired reads alignment engine 1014 can be configured to receivepaired end reads from the preprocessing module 1012, align the pairedend reads, and provide the aligned paired end reads to the sequencecalling engine 1018.

In various embodiments, the alignment of the sequence fragment andreference sequence can include a limited number of mismatches between afirst paired end read and a second paired end read. Generally, a portionof the first paired end read sequence can be aligned to a portion of thesecond paired end read sequence in order to minimize the number ofmismatches between the first and second paired end read sequences.

The sequence calling engine 1018 can be configured to receive alignedpaired end reads from the paired end read alignment engine 1014, analyzethe alignments to identify concordant and discordant positions,determine consensus base calls for the aligned portions, and provide thecalls and signal information to the scoring engine 1020.

Scoring engine 1020 can be configured to receive the calls and signalinformation from the sequence calling engine 1018, and determine aquality value for the calls. The quality value can represent alikelihood that the call accurately represents the sequence of thepolynucleotide at the position and can be based on the signalinformation for the paired end reads and the agreement between thepaired end reads.

Post processing engine 1022 can be configured to receive the calledsequence and quality values and perform additional processing steps. Forexample, the post processing engine 1022 may filter the reads, such asby selecting only aligned portions of the paired end reads anddiscarding unaligned portions or reads that were not found to align oroverlap with a corresponding paired end read. Further, the postprocessing engine 1022 can format the sequence data for display ondisplay 1010 or use by client device 1008.

Client device 1008 can be a thin client or thick client computingdevice. In various embodiments, client terminal 1008 can have a webbrowser (e.g., INTERNET EXPLORER™, FIREFOX™, SAFARI™, etc.) that can beused to communicate information to and/or control the operation of thepre-processing module 1012, mapping module 1014, realignment engine1018, variant calling engine 1020, and post processing engine 1022 usinga browser to control their function. For example, the client terminal1008 can be used to configure the operating parameters (e.g., matchscoring parameters, annotations parameters, filtering parameters, datasecurity and retention parameters, etc.) of the various modules,depending on the requirements of the particular application. Similarly,client terminal 1008 can also be configure to display the results of theanalysis performed by the variant calling module 1016 and the nucleicacid sequencer 1004.

It should be understood that the various data stores disclosed as partof system 1000 can represent hardware-based storage devices (e.g., harddrive, flash memory, RAM, ROM, network attached storage, etc.) orinstantiations of a database stored on a standalone or networkedcomputing device(s).

It should also be appreciated that the various data stores andmodules/engines shown as being part of the system 1000 can be combinedor collapsed into a single module/engine/data store, depending on therequirements of the particular application or system architecture.Moreover, in various embodiments, the system 1000 can compriseadditional modules, engines, components or data stores as needed by theparticular application or system architecture.

In various embodiments, the system 1000 can be configured to process thenucleic acid reads in color space. In various embodiments, system 1000can be configured to process the nucleic acid reads in base space. Invarious embodiments, system 1000 can be configured to process thenucleic acid sequence reads in flow space. It should be understood,however, that the system 1000 disclosed herein can process or analyzenucleic acid sequence data in any schema or format as long as the schemaor format can convey the base identity and position of the nucleic acidsequence.

In various embodiments, the methods of the present teachings may beimplemented in a software program and applications written inconventional programming languages such as C, C++, etc.

While the present teachings are described in conjunction with variousembodiments, it is not intended that the present teachings be limited tosuch embodiments. On the contrary, the present teachings encompassvarious alternatives, modifications, and equivalents, as will beappreciated by those of skill in the art.

Further, in describing various embodiments, the specification may havepresented a method and/or process as a particular sequence of steps.However, to the extent that the method or process does not rely on theparticular order of steps set forth herein, the method or process shouldnot be limited to the particular sequence of steps described. As one ofordinary skill in the art would appreciate, other sequences of steps maybe possible. Therefore, the particular order of the steps set forth inthe specification should not be construed as limitations on the claims.In addition, the claims directed to the method and/or process should notbe limited to the performance of their steps in the order written, andone skilled in the art can readily appreciate that the sequences may bevaried and still remain within the spirit and scope of the variousembodiments.

The embodiments described herein, can be practiced with other computersystem configurations including hand-held devices, microprocessorsystems, microprocessor-based or programmable consumer electronics,minicomputers, mainframe computers and the like. The embodiments canalso be practiced in distributing computing environments where tasks areperformed by remote processing devices that are linked through anetwork.

It should also be understood that the embodiments described herein canemploy various computer-implemented operations involving data stored incomputer systems. These operations are those requiring physicalmanipulation of physical quantities. Usually, though not necessarily,these quantities take the form of electrical or magnetic signals capableof being stored, transferred, combined, compared, and otherwisemanipulated. Further, the manipulations performed are often referred toin terms, such as producing, identifying, determining, or comparing.

Any of the operations that form part of the embodiments described hereinare useful machine operations. The embodiments, described herein, alsorelate to a device or an apparatus for performing these operations. Thesystems and methods described herein can be specially constructed forthe required purposes or it may be a general purpose computerselectively activated or configured by a computer program stored in thecomputer. In particular, various general purpose machines may be usedwith computer programs written in accordance with the teachings herein,or it may be more convenient to construct a more specialized apparatusto perform the required operations.

Certain embodiments can also be embodied as computer readable code on acomputer readable medium. The computer readable medium is any datastorage device that can store data, which can thereafter be read by acomputer system. Examples of the computer readable medium include harddrives, network attached storage (NAS), read-only memory, random-accessmemory, CD-ROMs, CD-Rs, CD-RWs, magnetic tapes, and other optical andnon-optical data storage devices. The computer readable medium can alsobe distributed over a network coupled computer systems so that thecomputer readable code is stored and executed in a distributed fashion.

1. A method of analyzing overlapping sequence information, comprising:obtaining first and second sequence signal information for apolynucleotide, the first and second sequence signal information derivedfrom sequencing at least partially overlapping regions of thepolynucleotide; aligning at least a portion of the first and secondsequence signal information; determining, with a processor, a degree ofagreement between the first and second sequence signal information for alocation along the polynucleotide; determining, with a processor, aweighted average signal for the location based on the first and secondsequence signal information; and determining, with a processor, a basecall and quality value for the location based on the weighted averagesignal and the degree of agreement.
 2. The method of claim 1, whereinobtaining the first and second sequence signal information includessequencing a target nucleic acid while extending a first primer in afirst direction, and sequencing the target nucleic acid while extendinga second primer in a second direction opposite the first direction. 3.The method of claim 1, wherein the sequence signal information is flowspace information.
 4. The method of claim 1, wherein the weightedaverage signal is determined in part based on the relative locationswithin the first and second sequence signal information.
 5. The methodof claim 1, wherein the quality value is higher for a location with ahigher degree of agreement than for a location with a lower degree ofagreement.
 6. The method of claim 1, wherein the degree of agreement isbased on an agreement between an initial base call from the firstsequence signal information and an initial base call from the secondsequence signal information.
 7. A system for analyzing overlappingsequence information, comprising: a processor configured to: obtainfirst and second sequence signal information for a polynucleotide, thefirst and second sequence signal information derived from sequencing atleast partially overlapping regions of the polynucleotide; align atleast a portion of the first and second sequence signal information;determine a degree of agreement between the first and second sequencesignal information for a location along the polynucleotide; determine aweighted average signal for the location based on the first and secondsequence signal information; and determine a base call and quality valuefor the location based on the weighted average signal and the degree ofagreement.
 8. The system of claim 6, wherein the sequence signalinformation is flow space information.
 9. The system of claim 6, whereinthe processor is configured to determine the weighted average signal atleast in part based on the relative locations within the first andsecond sequence signal information.
 10. The system of claim 6, whereinthe processor is configured to determine a higher quality value for alocation with a higher degree of agreement than for a location with alower degree of agreement.
 11. The system of claim 6, wherein the degreeof agreement is based on an agreement between an initial base call fromthe first sequence signal information and an initial base call from thesecond sequence signal information. 12-16. (canceled)
 17. A method ofanalyzing overlapping sequence information, comprising: sequencing atarget nucleic acid in a first direction to obtain a first sequencesignal information; sequencing the target nucleic acid in a seconddirection to obtain a second sequence signal information; aligning atleast a portion of the first and second sequence signal information;determining a degree of agreement between the first and second sequencesignal information for a location along the polynucleotide; determininga weighted average signal for the location based on the first and secondsequence signal information; and determining a base call and qualityvalue for the location based on the weighted average signal and thedegree of agreement.
 18. The method of claim 17, wherein sequencing inthe first direction includes extending a first primer in the firstdirection.
 19. The method of claim 18, further comprising removing aportion of the target nucleic acid after sequencing in the firstdirection, leaving the extended first primer and a portion of the targetnucleic acid to act as a second primer.
 20. The method of claim 19,wherein sequencing in the second direction includes extending the secondprimer in the second direction.
 21. The method of claim 17, wherein thesequence signal information is flow space information.
 22. The method ofclaim 17, wherein the weighted average signal is determined in partbased on the relative locations within the first and second sequencesignal information.
 23. The method of claim 17, wherein the qualityvalue is higher for a location with a higher degree of agreement thanfor a location with a lower degree of agreement.
 24. The method of claim17, wherein the degree of agreement is based on an agreement between aninitial base call from the first sequence signal information and aninitial base call from the second sequence signal information.